ClusterDE

a post-clustering differential expression method

Carson Zhang

July 1, 2024

Introduction

Background

Biology

TODO: give the basic biology background necessary to understand the paper.

RNA

RNA carries the genetic information specific in DNA. There are two main types of RNA.

Non-coding RNA performs some biological function.

Messenger RNA forms a template for protein production (it codes for a protein which performs some biological function).

TODO: explain the jump from RNA to the UMI count matrix.

TODO: define UMI.

scRNA-seq

Double-dipping

TODO: explain the double-dipping problem in differential expression analysis.

Biologists are interested in determining the cell types of the cells in a sample. (TODO: give a concrete example.) A natural way to do this might look like this.

  • Cluster the sample. We hope that each cluster represents a distinct cell type.
  • For each gene, perform a

Method

In words, the ClusterDE method can be broken up into four steps:

  1. Generate a synthetic null dataset that mimics the structure (in particular, the gene-gene correlation structure) of the original data.

  2. Separately partition the synthetic null data and the target data (real data) into two clusters.

  3. Separately for the null and target data, perform hypothesis tests for differentially expressed genes between the two clusters. For each gene, compute some sort of difference between the scores on the two datasets.

  4. Output a subset of the significant results from step 3 as potential cell-type marker genes.

Method: graphic

An graphical illustration of ClusterDE.

1. Synthetic null generation

The synthetic null generation consists of three steps, as described in the following figure.

Null generation steps.

Synthetic null generation

1. Model the null distribution in terms of the Gaussian copula.

Sklar’s Theorem

Theorem (Sklar’s Theorem): Let \(\mathbf{X}\) be a \(m\)-dimensional random vector with joint cumulative distribution function \(F\) and marginal distribution functions \(F_j, j = 1, ..., m\). The joint CDF can be expressed as

\[ F(x_1, ..., x_m) = C(F_1(x_1), ..., F_m(x_m)) \]

with associated probability density (or mass) function

\[ f(x_1, ..., x_m) = c(F_1(x_1), ..., F_m(x_m)) f_1(x_1) ... f_m(x_m) \]

for a \(d\)-dimensional copula \(C\) with copula density \(c\).

Sklar’s Theorem (cont.)

The inverse also holds: the copula corresponding to a multivariate CDF \(F\) with marginal distribution functions \(F_j, j = 1, ..., m\) can be expressed as

\[ C(u_1, ..., u_m) = F(F_1^{-1}(u_1), ..., F_m^{-1}(u_m)) \] and the copula density (or mass) function is

\[ c(u_1,...,u_m) = \frac{f(F_1^{-1}(u_1), ..., F_m^{-1}(u_m))} {f_1(F_1^{-1}(u_1)) ... f_m(F_m^{-1}(u_m))} \] .

Why Sklar’s Theorem

We choose \(C\), the specific copula function that we will use to approximate \(F\). For convenience, we choose \(C\) to be a multivariate Gaussian distribution: something we know how to sample from. Therefore, we have found a way to estimate the joint distribution of genes as a function of a multivariate Gaussian.

Now, our goal is to estimate the parameters \({\mu_j, \sigma_j}_{j = 1}^m\) and \(\mathbf{R}\).

2. Fit the null model to the real data.

Recall that the power of the copula is that it allows us to consider the marginal distributions separately from the covariance structure. Therefore, we can proceed as follows.

For each gene \(j\), estimate \(\{\mu_j, \sigma_j\}\) using maximum likelihood. These are the marginal distributions.

2. Model the dependence structure.

For the entire dataset, use the Gaussian copula to model the dependence structure.

  • Transform the raw data (counts) to the CDF values of the counts. (TODO: determine whether we use the empirical CDF.) Denote this as \(u_j := F_j(y_j)\). By the probability integral transform, this random variable has the \(U(0, 1)\) distribution.
  • Transform the CDF values to standard Gaussian random variables. That is, compute \(\Phi^{-1}(F_j(y_j))\)
  • Fit a \(m\)-dimensional multivariate Gaussian distribution to this data to compute \(\mathbf{\hat{R}}\). For example, you can use
  • Sample from this multivariate Gaussian distribution: \(N_m(\mathbf{0}, \mathbf{\hat{R}})\).
  • Transform the multivariate Gaussian sample to its CDF values.

3. Sample from the fitted null model.

  • Generate a sample of size \(n\) from \(N_m(\mathbf{0}, \mathbf{\hat{R}})\).
  • Convert them to negative binomial count vectors.

2. Clustering

ClusterDE allows any clustering algorithm. Note that it only handles the case of two clusters, so if you started out with more clusters, you should identify a particular pair of interest. In the Practical guidelines for ClusterDE usage subsection, steps 1 and 2 describe how an analyst should proceed.

  1. Given \(\geq 2\) clusters, identify 2 clusters of interest. Generally, this will be a pair for which you suspect the clustering is spurious (i.e. you think the two clusters actually come from the same cell type, so they are strong candidates to be merged into a single cluster).

  2. Filter the data so that you only consider the subset of cells that come from those two clusters.

TODO: describe the Seurat clustering pipeline.

UMAP

UMAP is common.

TODO: summarize UMAP.

Louvain

The example analyses in the presentation use the default Seurat clustering procedure, which uses the Louvain algorithm.

TODO: summarize the Louvain algorithm.

3. DE analysis (testing)

ClusterDE allows any DE test.

TODO: choose and summarize common DE tests.

Let \(P_1, ..., P_m\) be the p-values computed by the \(m\) DE tests on the target data. Define the target DE score \(S_j := -\log_{10}P_j\). Likewise for the synthetic null data.

TODO: motivate the formula for the DE score. It is essentially the information content of the p-value.

The final outputs of step 3: \(m\) target DE scores \(S_1, ..., S_m\); \(m\) null DE scores \(\tilde{S}_1, ..., \tilde{S}_m\).

4. FDR control

Given the target and null DE scores, compute a contrast score for gene \(j\) as \(C_j := S_j - \tilde{S}_j\).

In reality, we will have some more positive contrast scores than negative ones.

In the case that there truly are two cell types, then we should definitely expect some true DE genes, and should therefore expect more positive contrast scores than negative ones.

Recall the contrast score formula: \(C_j := S_j - \tilde{S}_j\). When this is positive, then the DE test on the target data was more surprising under the null, which means it outputted a smaller p-value. This should be the case if the null is false. This is why the authors say that, ideally, slightly less than 50% of all genes’ contrast scores should be negative: close, but slightly under, since in some cases, the null is false.

We want to satisfy the symmetry requirement of Clipper. This is analogous to the uniform p-value assumption in other hypothesis tests.

Motivation for the cutoff: under the null hypothesis, we have a symmetric distribution of contrast scores. Assume the negative tail is differences due to random noise. Under the null hypothesis, the right tail would be the same size.

Our real right tail will be bigger (we hope), since there are true differentially expressed genes which we will discover. And each of these will appear in the right tail. So any instances where the null is violated will appear only in the right tail.

We choose the minimum \(t\) to maximize our discoveries.

Results

Simulation

Real data example

Appendix